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BAYESIAN MULTIVARIATE MIXED-SCALE DENSITY 

ESTIMATION 

By Antonio Canale and David B. Dunson 
Universitd degli Studi di Padova and Duke University 

Although univariate continuous density estimation has received 
abundant attention in the Bayesian nonparametrics hterature, there 
is essentially no theory on multivariate mixed scale density estima- 
tion. In this article, we consider a general framework to jointly model 
continuous, count and categorical variables under a nonparametric 
prior, which is induced through rounding latent variables having an 
unknown density with respect to Lesbesgue measure. For the pro- 
posed class of priors, we provide sufflcient conditions for large sup- 
port, strong consistency and rates of posterior contraction. These 
conditions, which primarily relate to the prior on the latent variable 
density and heaviness of the tails for the observed continuous vari- 
ables, allow one to convert sufficient conditions obtained in the set- 
ting of multivariate continuous density estimation to the mixed scale 
case. We provide new results in the multivariate continuous density 
estimation case, showing the KuUback-Leibler property and strong 
consistency for different mixture priors including priors that parsi- 
moniously model the covariance in a multivariate Gaussian mixture 
via a sparse factor model. In particular, the results hold for Dirichlet 
process location and location-scale mixtures of multivariate Gaus- 
sians with various prior specifications on the covariance matrix. 

1. Introduction. It is routine in many application domains to collect 
multivariate mixed scale data consisting of binary, categorical, continuous 
and count measurements, motivating an increasingly rich literature on meth- 
ods for analysis of such data. Perhaps the most common approach in prac- 
tice is to link each observed variable to one or more underlying Gaussian 
variables. Relationships among the underlying Gaussian variables are typi- 
cally characterized through latent factor or structural equation models as in 
Muthen [23]. Although the underlying Gaussian class is convenient compu- 
tationally and in terms of providing an interpretable framework for studying 
dependence among mixed scale variables, the flexibility is limited in imply- 
ing Gaussian distributions for continuous variables, probit models for cate- 
gorical variables and a restrictive dependence structure. In addition, issues 
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arise in modeling counts and categorical variables having very many levels 
due to the need to introduce and do computation for very many threshold 
parameters. An alternative class of joint models for mixed scale data can 
be induced by defining a separate generalized linear model for each vari- 
able, with shared latent variables included within these models to induce 
dependence [26; 22; 5; 6]. This framework assumes that observed variables 
are independently drawn from distributions in the exponential family con- 
ditionally on latent variables. In marginalizing out the latent variables, one 
obtains a multivariate distribution with essentially unknown properties and 
computation can be quite challenging. In certain cases, pitfalls can arise due 
to the dual role of the latent factors in controlling the dependence structure 
and the shape of the marginal distributions. 

Given these issues, it is quite appealing to consider nonparametric mod- 
els for flexibly and parsimoniously estimating unknown joint distributions 
for mixed scale data. Somewhat surprisingly given the considerable applied 
interest, the literature on nonparametric estimation for mixed scale data 
is very small. From a frequentist kernel smoothing perspective, Li, Racine 
and co-authors [18; 17; 24; 19] proposed mixed kernel methodology and 
considered properties under somewhat restrictive conditions. Efromovich [7] 
recently relaxed these conditions and proposed a data-driven estimator de- 
signed to combat the curse of dimensionality. However, his work still as- 
sumed compact support for continuous variables and bounded support for 
discrete variables. To our knowledge, there has been no work on general 
nonparametric Bayes mixed data density estimation. 

We note that one can potentially combine two existing models to obtain 
a seemingly flexible and computationally tractable model for mixed scale 
densities, which also addresses the curse of dimensionality problem. Namely, 
we can combine the class of underlying Gaussian models mentioned above 
with mixture of factor analyzers, which characterize multivariate continuous 
densities as mixtures of Gaussians with a factor analytic factorization of 
the covariance [11; 4; 21]. A conceptually related approach was proposed 
by Yang and Dunson [34], but instead of mixing Gaussian factor models, 
they used a nonparametric Bayes approach to allow unknown latent variable 
distributions in structural equation models that accommodate relationships 
among the latent variables. To define a nonparametric model for counts, 
Canale and Dunson [3] proposed to round an underlying variable having 
an unknown density given a Dirichlet process mixture (DPM) of Gaussians 
prior [20; 8]. 

Our focus is on defining classes of Bayesian models for mixed scale den- 
sities, which are computationally convenient and can be shown to have ap- 
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pealing theoretical properties, such as large support, posterior consistency 
and near optimal rates of convergence. Instead of developing fundamentally 
new theoretical tools for the study of mixed scale densities, our goal is to 
provide theorems that allow leveraging on results obtained for multivariate 
continuous densities. With this goal in mind, we focus on a multivariate 
mixed scale generalization of the rounding framework of [3]. In particular, 
we propose to induce a prior on a mixed scale density by defining a prior on a 
multivariate continuous density for underlying variables, which are rounded 
to induce categorical or count measurements. This framework has the ap- 
pealing feature of avoiding computation for thresholds, greatly facilitating 
handling of discrete variables having many levels. Theoretical properties de- 
pend crucially on the prior for the underlying continuous density. 

For modeling continuous densities, standard nonparametric Bayes meth- 
ods rely on mixture models of the form 



where K{-; Q) is a known probability kernel having parameters (e.g., Gaus- 
sian with Q including the mean and covariance) and P is an unknown mixing 
measure assigned a prior 11. A common choice for 11 is the Dirichlet process 
of Ferguson [9; 10]. 

Ghosal, Ghosh and Ramamoorthi [12] derive sufficient conditions on the 
prior and the true distribution /q in order to achieve strong posterior con- 
sistency under a DPM of univariate Gaussians. Tokdar [30] relaxed their 
conditions assuming a Dirichlet process location-scale mixture of univariate 
Gaussians. Ghosal and van der Vaart [14; 15] give the rate of convergence 
for Bayesian univariate density estimation using a DPM of Gaussians. 

The only results (to our knowledge) on asymptotic properties of Bayesian 
procedures for multivariate continuous density estimation are presented by 
Ghosal and co-authors [33; 29]. In both papers the models considered are 
quite limited in scope in focusing on DP location mixtures of Gaussian ker- 
nels. Posterior consistency is studied in [33] assuming a truncated inverse- 
Wishart prior for the Gaussian covariance. In [29] near minimax optimal 
rates of posterior contraction are shown under some conditions on the true 
density assuming a diagonal covariance in the Gaussian kernel with indepen- 
dent truncated inverse- gamma priors on the diagonal elements. In practice, 
it is well known that using a diagonal covariance may lead to less efficient 
results in small to moderate samples. In addition, it is preferable to avoid 
arbitrary truncations and allow broader priors than inverse gammas and 
inverse Wisharts. For example, for high-dimensional data it is well known 
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that inverse Wisharts provide a poor choice and alternatives based on factor 
analytic and other factorizations are commonly used. 

In the next section we describe the space of mixed scale densities and 
introduce priors on this space via priors on a latent space and mapping 
functions. The section presents Theorems on the KL support of the prior, 
strong posterior consistency and rates of posterior contraction. Given the 
lack of results in the multivariate continuous density estimation context we 
also generalize the results of [33] using a modification of the sieve suggested 
by Pati, Dunson and Tokdar [25]. Such a construction avoids the explosion of 
the Li-metric entropy noted by [33] and allows us to obtain strong posterior 
consistency also in multivariate density estimation under a location scale- 
association mixture model. Proofs not given in the text are reported in the 
Appendix. 

2. Mixed-scale densities. 

2.1. Preliminaries. Our focus is on modeling of joint probability distri- 
butions of mixed scale data y = {yf, yj)'^, where yi = (yi,i, . . . , yi.pj G 
is a pi X 1 vector of continuous observations and y2 = {y2,pi+i, ■ ■ ■ , y2,p) £ Q 
with Q = (^^1]^{0, 1, . . . , — 1} is a p2 X 1 vector of discrete variables having 
q= {qi, ■ ■ ■ ,qp2)'^ as the respective number of levels and p2 = p—pi- Clearly 
2/2 can include binary variables {qj = 2), categorical variables {qj > 2) or 
counts {qj = oo). Hence, y is a p x 1 vector of variables having mixed mea- 
surement scales. We let y ^ f, with / denoting the joint density with respect 
to an appropriate dominating measure fi to be defined below. The set of all 
possible such joint densities is denoted Following a Bayesian nonpara- 
metric approach, we propose to specify a prior / ~ 11 for the joint density 
having large support over J^. 

For the continuous variables, we let {Qi,Si, ni) denote the cr-finite mea- 
sure space having Qi = M^^, Si the Borel a-algebra of subsets of ili, and fii 
the Lesbesgue measure. Similarly for the discrete variables we let (r22, ^2, ^2) 
denote the a-finite measure space having Q2 C N^^, a subset of the p2- 
dimensional set of natural numbers, ^2 containing all non-empty subsets of 
172; and /i2 the counting measure. Then, we let /x = /i2 x At2 be the product 
measure on the product space {0,,S) = {Qi,Si) x (1^2,52). To formally de- 
fine the joint density /, first let z/ denote a cj-finite measure on (1^,5) that 
is absolutely continuous with respect to /j.. Then, by the Radon-Nikodym 
theorem there exists a function / such that = fd^. 

In studying properties of a prior 11 for the unknown density /, such as 
large support and posterior consistency, it is necessary to define notions of 
distance and neighborhoods within the space of densities T. Letting Jq £ T 
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denote an arbitrary density, such as the true density that generated the data, 
the Kullback-Leibler divergence of / from /o can be defined as 



dKL{hJ)= / /o log(/o//)d/i = / / /olog(/o//)d/Uid/i2 

= [ E/0(yi,y2)log(f^)rf^,(,0 

J^"^ ,an \f{yi,y2)J 

with the integrals taken in any order from Fubini's theorem. Another topol- 
ogy is induced by the Li-metric. If / and /o are probability distributions 
with respect to the product measure their Li-distance is defined as 



diifo, f) = I \fo - f\df^ = / \fo-f\d^^ld^i2 

X] \fo{yi,y2) - f{yi,y2)\dni{yi). 

y2&Q 

2.2. Rounding prior. In order to induce a prior / ~ 11 for the density of 
the mixed scale variables, we let 

(1) y = hiy*), y*^f\ r~n*, 

where h : W ^ Q, y* = {yl, . . . , y*y £W,f€T*, F* is the set of densities 
with respect to Lesbesgue measure over M^, and 11* is a prior over F* . To 
introduce an appropriate mapping /i, we let 

(2) h{y*) = {h^{yl)'^ MivlfY . 

where hi{y\) = y\ is the identity function and /12 are thresholding func- 
tions that replace the real-valued inputs with non-negative integer outputs 
by thresholding the different inputs separately. Let A'^^^ = {A^^\ . . . ,A^^^} 
denote a prespecified partition of into qj mutually exclusive subsets, for 

(7) (7) 
j = 1, . . . ,p2, with the subsets ordered so that Aj^ is placed before A] for 

all h < I. Then, letting Ay^ = {y^ ■ y^ j S j = 1, • • • ,^2}, the mixed 
scale density / is defined as 

(3) f{y)=g{r)= [ r{y*)dy\ 

JAy^ 

The function g : F* — t- J- defined in (3) is a mapping from the space of 
densities with respect to Lesbesgue measure on to the space of mixed- 
scale densities F. 
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This framework generalizes [3], which focused only on count variables. 
The theory is substantially more challenging in the mixed scale case when 
there are continuous variables involved. Clearly the properties of the induced 
prior / ~ n will be driven largely by the properties of /* ~ 11*. Lemma 
1 shows that the mapping g : T* — )• T maintains Kullback-Leibler (KL) 
neighborhoods. The proof is omitted as being a straightforward modification 
of that for Lemma 1 in [3]. 

Lemma 1. Choose any /q such that Jq = ^(/o) for any fixed /o € J-'. 
Let K.e{fQ) = {/* : dKLifo , f*) < ^} be a Kullback-Leibler neighborhood of 
size e around /q. Then the image g{ICe{fQ)) contains values f G in a 
Kullback-Leibler neighborhood of fo of at most size e. 

Large support of the prior plays a crucial role in posterior consistency. 
Under the theory of Schwartz [27], given /o in the KL support of the prior, 
to obtain posterior consistency we need to ensure the existence of an expo- 
nentially consistent sequence of tests for the hypothesis Hq : f = fo versus 
-ffi : / G {fo) where U {fo) is a neighborhood of fo- Ghosal et al. [12] show 
that the existence of such a sequence of tests is guaranteed by balancing the 
size of a sieve and the prior probability assigned to its complement. 

We now provide sufficient conditions for Li posterior consistency for priors 
in the class proposed in expression (1). Our Theorem 1 builds on Theorem 8 
of [12]. The main differences are that we define the sieve J-n as g{J-*), where 
J^* is a sieve on J^* and that we require conditions on the prior probability in 
terms of the underlying H*. The proof relies on the same steps of [12] given 
lemmas 4 and 5 (reported in the Appendix) which give an upper bound for 
the Li metric entropy J{6,J-*) defined as the logarithm of the minimum 
number of 5-sized Li balls needed to cover J^*. 

Theorem 1. Let H be a prior on T induced by H* as described in ex- 
pression (1). Suppose fo is in the KL support of H and let U = {f £ T : 
11/ ~ /oil < e}- If for each e > 0, there is a 5 < e, ci, C2 > 0, /? < and 
there exist sets J-"* C J-* such that for n large 

(i) Ii*{Tf) < cie-"^2. 
(a) J{S,T*) < n(3 

then Il{U I . . . ,y„) 1 a.s. Pf^. 

We now state a theorem on the rate of convergence (contraction) of the 
posterior distribution. The theorem gives conditions on the prior 11* similar 
to those directly required by Theorem 2.1 of [13]. The proof is reported in 
the Appendix. 
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Theorem 2. Let n he the prior on J- induced by IT* as described in ex- 
pression (1) and U = {f : d{f,fQ) < Men} with d the Li or Hellinger 
distance. Suppose that for a sequence e„, with — and ne^ — >• oo, 
a constant C > 0, sets J"* C F* and 5* = {/* : / /q log(/o//*)d/i < 
e^,/ f*{log{f*/f*) fdfi < el} defined for a given f* G g-^{fo), we have 

(Hi) J(e„, J"*) < Cnel; 
M n*(J-*^)<exp{-ne2(C + 4)}; 
(v) n*(B;)>exp{-Cne2} 

then for sufficently large M, we have that n(C/'^ | yi, . . . ,yn) — )• m Pf^- 
probability. 

3. Consistency in multivariate continuous density estimation. 

From Section 2.2 it is clear that the properties of the induced prior / = 
dif*) ^ n depend heavily on the choice of prior /* ~ IT*. Our hope is 
to leverage the rich literature on models and theory for continuous density 
estimation in developing associated models and theory for the mixed scale 
case. A first step in utilizing the methodology and theory developed in Sec- 
tions 2.1-2.2 is to define priors for unknown multivariate densities /* with 
respect to Lesbesgue measure on and verify that these priors have ap- 
pealing properties in terms of large support and posterior consistency in the 
simple case in which pi = p, so that y = y* and all underlying variables are 
observed directly. 

Dirichlet process mixtures (DPMs) form the most widely applied class 
of models for Bayesian density estimation, with a rich theoretical litera- 
ture available in the univariate continuous case on posterior consistency 
[12; 1; 30; 32] and rates of posterior contraction [13; 14; 15; 31; 28]. DPMs 
of Gaussian kernels have proven successful for multivariate density estima- 
tion in challenging cases involving high-dimensional data [4]. However, to 
our knowledge the only results currently available on Li consistency for 
multivariate density estimation rely on DPMs of multivariate Gaussian ker- 
nels [33; 29]. Our focus is also on DPMs of Gaussians, but we generalize 
the results in [33; 29] to more flexible mixtures that enable scaling to higher 
dimensions using sparse models for the kernel covariance and other modifi- 
cations. 

3.1. Dirichlet process location mixtures. We initially consider DP loca- 
tion mixtures of multivariate Gaussian kernels, which let 

(4) f*p^j,{y) = I ^{y;9,^)dP{e), P ^ DP{aP^), S ~ H, 
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with 6, S) denoting the multivariate Gaussian density with location 6 G 
MP and covariance S G Mp, a > is a concentration parameter, Pq is a DP 
base probability measure, and IT is a prior on the space Mp oipxp positive 
semi-definite matrices. In marginalizing out P one can write the model (4) 
as 

oo 

(5) tp^^iy) = Y,^k^iy^9k,T.), Or^Po, 7Tk = Vkll{l-Vi), 

k=l Kk 

where ~ beta(l, a) for each k. 

The following theorem provides regularity conditions on the true data- 
generating density /q to ensure that it falls within the KL support of the 
prior n* on /* induced by (4) . This is a particular case of Theorem 2 of [33] 
where they let P ~ vrp, with vrp an arbitrary prior with weak support on 
the space of multivariate continuous densities J-*. 

Theorem 3. Let /q G J^* be a density ouerM^ with respect to Lesbesgue 
measure and let H* denote the prior on f* induced from (4)- Assume the 
following 

1. < f^iy) < M* for some constant M* and all y G W; 

2. |//o*(y)log/o*(y)dy| <oo; ^ 

3. for some 6 > 0, f f^ (y) log dy < oo, where (t>s{y) = inf||^/_^||^^ /^(y); 
4- for some ry > 0, / \\y\\'^^^^~^^^ fo (y) dy < oo. 

Then /q is in the KL support ofH*. 

In proving strong posterior consistency in non-compact spaces, such as 
T*, a critical step is to introduce a compact subset T* that is indexed by 
the sample size n and that grows to fill the entire space as n — )• oo. This 
sequence of subsets is typically referred to as a sieve, and the choice of the 
sieve for multivariate density estimation is quite important as standard suf- 
ficient conditions for posterior consistency require the Li metric entropy of 
J-* to grow slower than linearly in n, while also requiring the prior probabil- 
ity assigned outside of J^* (to -F**^) to decrease exponentially fast in n. If the 
sieve is not very carefully chosen, J^* may be quite small, so that the con- 
dition on the prior becomes very restrictive and the prior may need to have 
very light tails or even compact support. The choice of the sieve is particu- 
larly crucial in multivariate density estimation, since naive choices may lead 
the Li metric entropy to "blow up" with dimension p. Lemma 2 proposes 
a sieve, which modifies the formulation of [25], and provides a bound on 
the Li metric entropy. This sieve is then utilized in Theorem 4, which gives 
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sufficient conditions on the prior under which the posterior !!*(• | yi, . . . , y„) 
is Li consistent. This result is of significant general interest in multivariate 
density estimation in relaxing conditions of [33]. 

Lemma 2. Let a, I, m be positive constants, Aj(S) the ith eigenvalue 
ordered from the smallest to the largest of Ti and J-a.i,m = {/ps • ll^fcll — 
a,k = l,.. . ,m, y^Ai(i;) > /, Y^k>m'^k < e}- Then 

J{4.e,Ta,i,m) < mlog jdi (^jY + d2^ + d'smlogid^m) 

where di, d2, d^ and d^ are constants depending on e. 

In [33] they stress that the usual method of constructing a sieve does not 
lead to a consistency result in the multivariate case because of the explosion 
of the Li-metric entropy. They propose to include a fixed bound on the 
highest eigenvalue of S through the prior. Lemma 2 introduces a novel sieve 
that is particularly designed to bypass this constraint through modifying 
the sieve of [25] to be formulated through a lower bound on the smallest 
eigenvalue of S. 

Theorem 4. Assume we observe an iid sample y = yi, . . . ,yn from /q 
satisfying 1-4 of Theorem 3. Consider the prior li* defined in (4). For any 
e > and /3 < if there exist sequences an = 0{^/n) and In = 0{l/^/n) 

and /3i > such that the following conditions hold: 

5. n{s: v/MS)</„} <e-"/^i; 

6. Po{||^|| >an}<e-"^S- 

7. {an/lnY<np 

then n*({/ : 1 ir - /o*l 1 < e} 1 2/1, . . . , y„) ^ 1 a.s. Pf, . 

The proof relies on verifying the conditions of Theorem 8 of [12]. In par- 
ticular the first two conditions ensure that the prior probability of the com- 
plement of the sieve is exponentially small while condition 7 determines the 
upper bound for the Li-metric entropy of the sieve described in Lemma 2 
with m = 0{n/ log(n)). 

3.2. Dirichlet process mean-covariance mixtures. A second mixture model 
generalizes model (4) by mixing also the covariance matrix: 

(6) f*p{y) = 1 4>{y;e,^)dP{0,J:), P^DP{aPo), 
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with Po being now a measure on MP x Mp. In marginalizing out P one can 
write the model (6) as 

oo 

(7) = ^^fc<A(y;^fc,Sfc), (0,S)~Po, 7rk = Vkl[{l-Vi), 

k=l l<k 

where Vk ~ beta(l, a) for each k. 

Theorem 5. Let /q be a density function on MP satisying 1-4 of Theo- 
rem 3. //n* is the prior induced by (6) then /q is in the KL support ofU*. 



Proof. The proof follows Theorem 2 of [33] in bounding the density of 
a multivariate normal density with general covariance matrix S by 

x^/yAX (p-l)/2 / \.,(y)\'^P-^^/^ 

3^ J 0(2/;O,Ai(S)/p) <0(y;O,S) < (^^J </<(y; 0, Ap(S)/p). 

Then by Theorem 2 and 5 of [32] for any e > we have an open set V G 
T* X Mp such that, for any P G we have 



/ 



f I \^ fo{y) A ^ ^ 

J 0(y : 9,ojIp)dP{e,u!) 2 



The rest of the proof follows from [33] . □ 

The next lemma describes the sieve and its size in terms of Li metric 
entropy. It is a generalization of the sieve defined in Lemma 2. The proof is 
reported in the Appendix. 

Lemma 3. Let a, h, I, m be positive constants and J^a,h,i,m = {fp '■ 
\\0k\\ < a, y^Xi(T.k) > /, y/Xp{T.k) < h,k = I,. . . ,m, Ylk>ni^k < e}- Then 

J{4:e,J='a,h,i,m) < mlog |di (^jY + d2log + daj + d^mlogid^m) 
where di, d2, d^, d/^ and d^ are constants depending on e. 

The next theorem gives the conditions for Li posterior consistency under 
the Dirichlet process mixture models (6). 

Theorem 6. Assume we observe an iid sample y = yi, . . . ,yn from /q 
satisfying 1-4 of Theorem 3. Consider the prior li* defined in (6). For any 
e > and f3 < if there exist sequences On = 0{^/n), hn = 0(exp(n)) 

and In = 0{1/ y/n) and /3i > such that the following conditions hold: 
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8. Pojll^ll > an, x/MS)</„,yA;(S) >/i„} <e— ft; 

9. {an/lnf < np, log(/i„//„) < n/3 

then n*({/ : lir - /oil < e} I 2/1, . . . ,y„) ^ 1 a.s. Pf,. 

Also here the proof consists in verifying that the conditions of Theorem 
8 of [12] are satisfied using the sieve construction of Lemma 3 with m = 
0{n/log{n)). 

3.3. Examples. The results of the previous subsections rely on Dirich- 
let process mixtures of multivariate Gaussian kernels, but the results are 
nonetheless quite broad in allowing general priors for the covariance E ~ 11 
in the location mixture case and general choices of base measure Pq in the 
location-covariance mixture case. These choices make a substantial practi- 
cal difference in applications, particularly in high-dimensional settings. In 
what follows, we show that posterior consistency can be obtained for some 
particular potentially useful cases of 11 and Pq. 

A common and convenient prior for the covariance matrix is ~ 
W{T,Q,r). Corollary 2 of [33] shows that a truncated Wishart has exponen- 
tially small probability on the complement of the sieve introduced therein. 
This artificial truncation of the Wishart, which may lead to hurdles in 
implementation, is no longer necessary under our sieve construction, and 
we can let IT = IW{T,o,r) in the settings of section 3.2 and Po{6,T,) = 
N(9; 6q, Qq) X IW{T,] Eq, r) in the settings of section 3.1. The next corollary 
formalizes the consistency of the two above mentioned priors showing that 
they satisfy the conditions of Theorems 4 and 6. The proofs are reported in 
the Appendix. 

Corollary 1. Assume we observe an iid sample from /g 

satisfying the conditions 1-4 in Theorem 3. Consider a prior li* induced by 
(4) with Po = N{eo,no) and E'^ ~ W{^o,r) (by (6) with Pq = N{9o,no) x 
IW{Eo,r)) with Eo = aolp. Then for any e > 0, U*{{f : ||/* - /*|| < e} | 
yi, . . . 1 a.s. Pj*. 

Remark 1. The statement of Corollary 1 is true also if we consider the 
prior induced by (6) with Pq^OjTi) = A^(6'o, ToE)/VF(r, Eq), since Po{||^|| > On} 
is exponentially small. 

It is well known that an inverse Wishart prior for the covariance tends to 
provide a poor choice when the dimensional p of the data is large, even in 
the simpler parametric setting involving one multivariate Gaussian kernel. 
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To address this problem, there is a rich literature on shrinkage priors for 
covariance matrices, with a commonly used and successful approach relying 
on factor analytic factorizations in which 

(8) s = rr^ + j^, r~nr, n^Un, 

where F is a p x r matrix and is a px p diagonal matrix. For example, [2] 
induces a prior for a covariance matrix through (8) with Ilr and FTq induced 
through letting 

h 

7jh\<t>jhTh--N{0,<Pj^T-'), 0,-;,~Ga(3/2,3/2), r;, = J]5;, 

1=1 

(9) 6i ~ Ga(ai,l), di ~ Ga(a2,l),/ > 1, aj"^ ~ Ga(a, = l,...,p, 

where 'jjh is the element in row j and column h of T and ct| is the jth 
diagonal element of 0. The next corollary provides conditions on Hr and 
Hq for Li posterior consistency in DP mixtures of multivariate normal factor 
models. 

Corollary 2. Assume we observe an iid sample from /q 

satisfying the conditions 1~4 i^' Theorem 3. Consider a prior H* defined in 
(4) with Pq = N{eo,no) (in (6) with Pq = A^C^cOq) x FI; and U following 
the factor model rapresentation of (8). Further assume that 

10. E{tr{TT'^)} < oo, E{tr{n)} < oo; 

11. Un{a] < 11} < e-'^ with In = 0{l/^); 

Then for any e > 0, n*({/ : H/* - /o*|| < e} | yi, . . . , y„) ^ 1 a.s. Pj*. 

Remark 2. The particular sparse factor model of [2] shown in (9) sat- 
isfies conditions 10 and 11. In fact ir(FF"^) and tr{Q) have expectations 
equal to pb/{a + l) and pYYi^^-^^aial^'^ respectively since they are distributed 
according to 

Mrr^) ~ E E <^Jh^"-h'^'xI Hn) = of, a] ~ /Ga(a, b). 

h=l 0=1 3=1 

Furthermore IlQ{a'j < 1^) = F(a, c//^)/F(a) ~ e~'^"' for some constant c > 0. 
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4. Appendix. 

Proof of Theorem 1. The next two lemmas are useful to determine 
the size of the parameter space of measured in terms of Li metric entropy. 
The first shows that the Li topology is maintained under the mapping g and 
the second bounds the Li metric entropy of a sieve. 

Lemma 4. Assume that the true data generating density is /o G -F. 
Choose any such that /o = g{fS). Let U{f^) = {f* : ||/o* - f*\\ < e} be 
a Li neighborhood of size e around /q. Then the image g{U{fQ)) contains 
values f £ in a Li neighborhood of fo of at most size e. 

The proof is omitted since it follows directly from the definition of Li 
neighborhood and from Fubini's theorem. 

Lemma 5. LetJ^* C denote a compact subset of T* , with J{5,F*^) the 
Li metric entropy corresponding to the logarithm of the minimum number of 
6 -sized Li balls needed to cover J^*. Letting J^n = g{J^n)) have J{5,Fn) < 
J{5,T*). 

Proof of Lemma 5. Let k = exp{J(5, J"*)} be the number of 5 balls 
needed to cover , with /*,..., denoting the centers of these balls so 
that J-* C Uti-^n,i, where F*^^ = {f* : \\f* - f*\\ < 6}. From Lemma 4, 
it is clear we can define Fn C UiLi ^n,i where Fn,i = g{F* J is an Li 
neighborhood around = g{fi) of size at most 5. This defines a covering of 
Fn using k (5-sized Li balls, but this is not necessarily the minimal covering 
possible and hence J{6,F*) provides an upper bound on J{5,Fn)- □ 

The rest of the proof follows along almost the same lines of [12] in showing 
that the sets Fn^i {f '■ ||/ — /o|| < e} and F^ satisfy the conditions of an 
unpublished result of Barron (see Theorem 4.4.3 of [16]). □ 

Proof of Theorem 2. Let Fn = g{F*). From Lemma 5 we have J((5, Fn) < 
J(6, F*). Let D{e, F) the e-packing number of F, i.e. is the maximal number 
of points in F such that the distance between every pair is at least e. For 
every e > e^, using (Hi) we have 

logZ)(e/2, J-) < \ogD{en,F) < Cnel 

Therefore applying Theorem 7.1 of [13] with j = 1, D[e) = exp(ne^) and 
e = Men with M > 2 there exist a sequence of tests {$n} that satisfies 
(10) 

£^/o{^n} < exp{-(i^M2-l)ne^}, sup Ef {l-^n} < C exp{-KnM^ el}. 
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The posterior probability assigned to U'-^ can be written as 



/nr^ii!Sdn(/) 



U{U^\y,,...,y^}- ^ ^(^^^ 



/niLiiSdn(/) 

Taking iCM^ > + 1 the first summand < 2exp{-Kne^} by 

(10). The rest of the proof consists in proving that the remaining equation 
goes to zero in Py-Q-probabihty. By Fubini's theorem and (10) we have 

Efo I (1 - ^n) / n yHdn(/) I < sup Ef{l-^n} < exp{-KnM\l}, 
while by {iv) we have 

^/o I / n 774dn(/)| < = U*{Tf) < eM-neliC + 4)}. 

The numerator of the second summand is hence exponentially small for 
M > y^(C + 4) /K. Finally we need to lower bound the denumerator. Clearly 

g{B*J CBn = !^f: J folog{fo/ f)d^i < el J U\og{h/ f)fdp. < el 

and then n(B„) > U{g{B*)) = n*(fi*) and using condition {v) on U*{B* 
we have 

J^Jfolog{fo/f)d^dU{f) < /^^ eldUif) 
J^^ J fo (log(/o//))' df,dU{f) < eldUif). 

Then using Lemma 8.1 of [13] we obtain 



„ n 



■^^^'Un(/)>exp{-ne2(C + 4)} 



that concludes the proof. □ 
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Proof of Lemma 3. The proof is similar to Theorem 7.5 of [25] while 
deaUng with the generahzation of the mixture model introduced in (6). For 
any /i , /2 G we have 



\\h-h\\<Y.-?\ 
i=\ 



(1) 



(2), 



+ 2e. 



1=1 



We are going to give the upper bound for the sieve using the usual steps 
as in [12], [30] and [33]. We start by showing that two single multivariate 
normal kernels with suitable parameters have L\ distance smaller than e. It 
can be shown that for any two multivariate normals with the same vector 
of mean and with det(Si) < det(S2) we have 



1 1 '/'El -'/'Sail < y 



det(S2 



a/2 



det(Si)i/2 



dx 



det(S5 



\l/2 



det(i;] 



U/2 



det(Si 



)l/2 



Hence 



+ 



< 



2 11^1-02 

^AT(Sl 



+ 



|det(S2 



a/2 



det(Si)i/2 



det(Si)i/2 



where the first summand can be obtained following the first part of the proof 
of Lemma 5 in [33] and the second follows from above. Let = min(e/2, 1). 
Define = /^(l + C)*"; m > 0. Let M be the smallest integer such that 
F{\ + C)*^ > h^- This clearly implies M < + C)"Mog(/i/0 + 1. For 
1 < j < M, let Nj = [^a/(eA]/^J] . For 1 < i < Nj; l<j<M, define 



-a + 



2a{i - 1) 



-a + 



2ai 



X (A,-_i,A,]. 



Then for (6li,Si) and (6*2, Sa) G {6*,^ : (6',det(S)) G Eij] we have that 
ll?^6»i,Si — '?^e2,E2ll < Let N be the minimum number of e ball to cover 
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@a,i,h = {<Ae,s : 1 1^1 1 < a, VAi(S) > /, ^VS) < h}. Clearly 




Let = {tt"^ = (7ri,...,7rm)}. Fix vr^^ and vr^ G G^. Let for k = 1,2 
yik) ^ _ ^^^^^f )). Clearly EZi " < ^ if for each h = 

l,...,m I V^^^^ - Vlf^l < e/m2. Since v/^^ V^"^^ G [0, 1], the number of e-balls 
required to cover is (m^/e)™ times a constant. Hence 

J(4e, J"a,h,«,m) < mlog |di (^y^ + (i2log + c?3 1 + cl^mlogid^ni) 

□ 



Proof of Corollary 1. We prove the theorem as stated in the within 
bracket version. We need to show that there exist sequences {an}, {hn} and 
{In} such that conditions 8-9 are satisfied. Let a„ = Ci-^/re, In = C2I \fn and 
hn = C3exp(n) such that log(C3/C2) < (Ci/C2)p < /3 < eVS. Condition 9 
is obviously satisfied. 

In order to show that condition 8 is satisfied we bound separately Pq {||^|| > f^n}, 
Pq I \l '^i(S) < I and Pq { \/ Ap(Il) > /in } • Note that to shorten the nota- 
tion we are using Pq {6* < a} and Pq { \/ ^^^(5^) < Q^} for -Po {(^) S) : 6* < a} 
and Pq {(^, S) : < a}. Since ^ ~ N{6o, ^0) we have, for some con- 

stant d > 0, Po {116*11 > an} < dexp(-a^) < dexp{-Cfn). Then 

Po { v^MS) < /„} = Po {Ap(S)-i > < {tr(S-i) > C^} 
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By definition of a Wisfiart distribution we have that tr(S^^) ~ croXpr ^^'^ 
since the has exponential tail, we have 

Po{VAi(S) < /„} < Po {tr(S-i) > a^^C^^} ~ e"^". 

Finally 

Po I^MS) > /in} < Po {tr(S) > /i^} < h-'Ep, {tr(S)} = /i;2tr(i?p„ {(S)}) 

by Markov's inequality. Since S is inverse Wishart distributed its expectation 
is a matrix with finite entries and hence its trace is finite almost surely. It 
follows that Pq { y'^A^jP^ > } ~ e"*^" that concludes the proof. □ 

Proof of Corollary 2. We prove the theorem as stated in the within 
bracket version. Let {an}, {/in} and {/„} as in the proof of Corollary 1 so 
that condition 9 is satisfied. Po{||^|| > fln} is exponentially small. Then 

Po|^Ai(S) < /„,^Ap(S) > /i„| <Po{^Ai(S) </„}+Po|^Ap(S) >/i„ 

We start on the condition on the smaller eigenvalue. We have 

Po {v/Ai(S) < In} = Po {Ai(0) + Ai(rr^) < il} < Un{Xi{n) < il}. 



Then 



nn{Xim < K} = Iin{ min < I'J 



= l-Iin{<yj>llY 
<i-nn{cT|>0 

2 ^ l2^ < -cn 



by condition 11. We show now the condition on the highest eigenvalue: 

Po |yAp(S) > /i„j = Po {Ap(I]) > hi] < Po {tr(S) > hi] < h'^E^ {tr(S)} 

by Markov's inequality. Since E {tr(S)} = E {tr(rr'^) + tr(J^)} = {tr(rr^)} + 
En^ {tr(r2)} with both the expectations finite by condition 10, we obtain 



Po|^Ap(S) <C/i;2 < e--" 

that concludes the proof. □ 
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